-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plugin interface for backends #570
base: master
Are you sure you want to change the base?
Conversation
Setting the tensor buffer to I think this can be handled by using functions to create custom buffers, such as |
So, wouldn't this mean a separate buffer (and an allocation for the buffer data) for each external tensor? Note that use case is changing the tensor external data pointer often. Input/output tensors need to be reset for every inference call |
Why not create a dummy buffer with size zero, which can be used for all external tensors for a given backend? Thus the backend will be accessible, but allocating data on this buffer will not be possible. Moreover a query of whether the tensor is external will also be possible ( |
Yes I think this is fine. Just make a custom type of buffer that returns |
@slaren, how's this? |
We cannot just add a dummy buffer to the backends, this may not work in every backend. For the CUDA backend, it should work like this: ggml_backend_buffer_t buffer = ggml_backend_cuda_buffer_from_ptr(NULL, 0);
ggml_backend_init_tensor(buffer, tensor, data);
void ggml_backend_init_tensor(..) {
tensor->data = data;
tensor->buffer = buffer;
ggml_backend_buffer_init_tensor(buffer, tensor);
} After this, modify |
Why not? It's a dummy buffer anyway. If for some reason a backend does not support external pointers, then it should have its This is much simpler than copying code for the the dummy buffer to each backend separately. |
The key part here is that the extra is reused if available and |
Some backends have buffer objects and cannot represent a memory address with just a pointer. This is not so uncommon either, Metal, OpenCL and Vulkan work in this way. So a general purpose dummy buffer is completely useless because it needs to be associated with the backend buffer object. It is not good to have a dummy object there that may be useless depending on the backend. Additionally, the code of This needs to be handled on a per backend basis, and the buffer interface is flexible enough to allow this. The CUDA backend needs a different extra for each tensor, the CUDA buffer implementation keeps its own ring buffer of extras. It is not ok to reuse the global ring buffer either, that and other globals in the CUDA backend will be removed in the future. |
I think there is some misunderstanding here. The dummy buffer is just a dummy. It should never be used to actually allocate. It's there just to signify that the tensor is external and to allow Moreover The workflow for a plugin should be like this:
This means that We don't to rebuild the graph with new tensors for every inference step, so we just reset their data pointers. |
The problem is that some backends cannot use dummy buffers, they need a buffer object in addition to an offset. For these tensors, the dummy buffer will never be useful. If you wanted to use an external buffer with these backends, you would need a new function to create a You can modify the implementation |
Wouldn't potential extra data like buffer objects, be stored in that backend's I still don't quite get what you mean. Say I have an OpenGL backend: GLuint vbo = ...;
ggml_backend_set_tensor_external_data(glbackend, mytensor, (void*)vbo)`
...
static void ggml_backend_opengl_set_tensor_external_data(ggml_backend_t backend, struct ggml_tensor * tensor, void * data) {
...
tensor->buffer = &backend->dummy_external_tensor_buffer;
extra->vbo = (GLuint)data;
extra->offset = 0;
} The dummy buffer is still usable. Its only point is to allow |
Though it does make sense to have |
I found a problem with the dummy buffer for cuda. Since it is impossible to associate something with the lifetime of a tensor, it is impossible to manage the extras properly. I'll try to think of a solution |
Ok. so, the main problem is views. Views are pointed when the compute graph is built. Obviously if there are views to external tensors, which are then redirected this will lead to bugs and crashes. While I have some ideas of how to make this work, I won't be adding them to this PR. Instead I'll go for the simpler solution: Use an allocator to set external pointers which means:
After this makes it into ggml I will open an issue do discuss potential improvements. Essentially a rewrite is incoming :) |
@slaren so this is the current version. The limitations are as mentioned above. Example usage is added. I talked with @ggerganov and there is no metal support yet. It's not that easy, though we have some ideas of how to handle it. Metal support can be addressed in the future if there's interest. The odds of having demand for metal-based plugins are minimal anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in ggml
are quite minimal and isolated, so I don't think there would be an issue to merge this. Adding a CI to run the plugin example would be useful for long-term support of this functionality.
I have already explained how I intend to implement support for external tensors. ggml-backend is a work in progress and frankly, it is quite annoying to be ignored about changes to a design that I am still working on. So my conclusion: it is too early to merge changes to ggml-backend. If you want to implement this change, please do it on the old ggml-cuda interface. |
@slaren That's a fair point. You have the final call for this PR and if you think it interferes with the design of the interface we will not merge it at this point. The change is trivial, so I don't think it would be a problem for @iboB to maintain it in a separate branch, until and if support for this sort of functionality becomes available in the future. |
It's trivial now, but it won't be trivial with any of these features:
I still think that the ability to create plugins to existing inference apps (built with torch, tensor flow, cuDNN, etc) is very valuable. |
BTW, @slaren how do you intend to implement external data to tensors? |
Probably in the way that I suggested earlier, but nothing is set in stone. I will get back to this while implementing support for mmap for llama.cpp. As for the views, we can add a function to ggml-backend to re-initialize a view, but I don't think that this can be automated, since there isn't any consistent way to enumerate all views of a tensor in ggml at the moment. The CUDA extras will also very likely be removed except for multi GPU in split tensors, so that may simplify some things a bit. |
If split tensor ops are eliminated, cuda extras can easily be eliminated as well. Otherwise it doesn't seem possible. An alternative approach for views would be to reinitialize them when computing the graph, as opposed to when building the graph. This is cheap and the additional overhead would be negligible. |
This implements a proposal to add a plugin interface to backends.
There is a demo project which uses the changes from this PR here: https://github.com/iboB/pytorch-ggml-plugin
Notable changes
add
set_tensor_external_data
to the backend interfaceThis is as simple as
tensor->data = external_data_ptr
on cpu and metal, but requires more steps on cuda and potentially other backends (creating tensor extra data)add
ggml_backend_cuda_init_plugin
This allows us to initialize the cuda backend with extrenal device id, cublas instance, and cuda stream
add
GGML_PLUGIN
config to CMakeLists.txtThis is explicitly not a CMake
option
it should be set toON
from the outside when one callsadd_subdirectory
to ggml for a plugin. It forces a static library build and adds-fPIC
.Questions
Tensors whose pointer is set externally with
set_tensor_external_data
have no buffer. Thusggml_get_backend
and associated opstensor_get/set
will simply crash when called on them.We could leave it as it is documenting that these ops are not supported for external tensors and they have null buffers, or we could add a dummy buffer with the correct backend to them (which may be tricky if one wants to allocate memory with it).
What about something else?